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I. INFRODUCTION 


Farmers once,used oxen to plow their fields. And when the. task got_too big for 
one ox they did nor try to grow a bigger ox. They got two of them? [Ref. 1] 


A. THE NEED FOR PARALLEL PROCESSING 

So too have we often found that one computer is not enough, or at least, not fast 
enough for many applications. While progress on producing faster single processor 
computers continues, it is the orders of magnitude leap in speed possible in 
multiple-processor computers that promises to lead computing into its Fifth 


Generation. 


Multiple-processor computers became] necessary because_a limit to higher speed 
ad been reached with brute-force approaches employing faster switching devices. 
Faster components made with gallium arsenide or Josephson junction devices can 
increase computer speed only 10 times if current uniprocessor architectures are 
used; however with the new architectures, there 1s hope of increasing speed 100 to 
1000 times. [Ref. 2] 


Such dramatic increases in computer speed would be of great benefit to 
researchers working on computationally-intensive and/or real time problems such as 
adaptive antenna control, weather prediction, or fusion reactor design. It 1s not merely 
a question of having the answers in seconds instead of minutes--once machines can 
perform calculations in real time, whole new applications suddenly become possible. 

As an example, consider a computer system which calculates the power spectral 
density of intercepted radar emitters. A system which takes an hour to analyze a few 
seconds’ worth of data may be useful to compile electronic intelligence data back at 
fleet headquarters--it produces answers long after the event is over. However, if the 
system could perform its analysis in real time it could be used onboard ship or in an 
aircraft to recognize hostile missile seekers and dispense chaff or activate jammers--that 
is, to respond to events as they happen. Increased speed alone could make this new 


application possible. 
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B. PARALLEL PROCESSORS DEPEND ON COMMUNICATION 


x 


When using a number of processors on a single problem, the exchange of data 
between processors becomes a critical bottleneck. TRef: 3 
Extensive research has already been conducted in many areas related to parallel 
processing, such as task distribution and software development. The research reported 
in this paper focused on the architecture of parallel-processing systems, especially with 
regard to inter-processor communications. | 
A system which uses more than one processor to perform a task must provide 
communication paths between the processors. There are essentially two approaches to 
this requirement: 


® provide a path from every processor to every other processor--”exhaustive” 
communications 


® provide paths, between each , processor and only some of the other 
processors-- limited” communications. 





bd 


Figure 1.1 Exhaustive Communicatons. 


1. Exhaustive Communications 
An exhaustive communication architecture (Figure 1.1) provides direct data 


exchange without bus contention or waiting. However, as the number of processors 


Pl 


rises, the number of communication paths in an exhaustive architecture becomes 
impractically large, leading to high costs. In addition, expansion of the network may 
be limited by the inability of the existing processors to accept another communication 
port. These difficulties with exhaustive communciation architectures have led many 
researchers to consider architectures based on limited communications. 
2. Limited Communications 

In limited communication architectures, [Ref. 4] identifies two major groups: 
dedicated path and shared path structures. Limited architectures employing dedicated 
paths enable a processor to exchange data without bus contention or waiting, but only 
with a limited number of processors. Figures 1.2 and 1.3 show two examples of a 


limited communication architecture employing dedicated paths. 





Figure 1.2. Limited Communications--Dedicated Path 
oop. 

Parallel-computing systems built around a limited communications-dedicated 
path concept can take advantage of the immediate communication between a given 
processor and the processors adjacent to it. Yet if a problem requires communication 
between non-adjacent processors, the message must be passed along by all the 
intermediate processors. Should the message reach a busy node, it may be delayed or 
even discarded, forcing a re-transmission. The resultant communication overhead 


could tle up the system and severely slow its operation. 


RZ 





Figure 1.3 Limited Communications--Dedicated Path 
Regular Network. 


Using a shared path (as in Figure 1.4) eliminates the need to relay data from 
one processor to another, because an uninterrupted path already exists between any 
two processors. For this reason, limited shared-path architectures are more flexible in 
the kinds of data flows which can be achieved and in the types of problems which can 
be solved than limited dedicated-path architectures. However, because processors must 
wait their turn to use the common communication path, system throughput may suffer. 
That is, unless the common bus runs at such a high speed that the processors can 
barely keep up with the bus. Such a high speed bus design would require a multiplexer 
on each chip capable of speeds considerably in excess of the speeds associated with 
conventional multiplexers. The Optoelectronic Multiplexer (OM) developed by the 


Naval Ocean Systems Center, San Diego, is such a device. 


C. THE OPTOELECTRONIC MULTIPLEXER CONCEPT 
1. Optical Switching Yields High Speed 
The Optoelectronic Multiplexer employs optically-activated junctions to 
sequentially link parallel data lines onto a serial bus. [Ref. 5] A laser pulse, fed to the 
junction by optical fiber, activates the junction, allowing conduction from the input 


line onto the main data transmission line. By using a different length of optical fiber 





Figure 1.4 Limited Communications--Shared Path. 


for each junction, the laser pulses will arrive at the junctions at different times. 
Consequently, the junctions are activated one at a time, which converts the parallel 
data waiting on the input lines to serial data pulses travelling along the output 
transmission line. The short pulsewidths generated by the laser allow extremely high 
pulse repetition frequencies--researchers have tested a prototype laser multiplexer at 
speeds as high as 7 Gbps. [Ref. 5] 
2. A Suitable Architecture Sought 

Current research [Refs. 6 - 10] is especially rich in_ parallel-processing 
architectures based on limited communication dedicated-path concepts, because shared 
path communications typically involve delays which could detract from the high 
performance otherwise achievable by parallel-processing designs. Prompted by the 
development of the high-speed Optoelectronic Multiplexer, which promises an increase 
in serial communication speed of at least one and perhaps two orders of magnitude, 
this project evaluated-the impact of using a shared bus and serial communication in a 
parallel processing computer architecture. Specifically, the following questions were 
posed: With current technology, is it feasible to fabricate an Optoelectronic 
Multiplexer-based multiple processor chip? What new architectures are made possible 
by the OM’s high speed? Which architecture makes optimum use of this new 


capability? 
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Figure 1.5 Optoelectronic Multiplexer Block Diagram 
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Four conditions would have to be met in order for a single-chip OM-based 


parallel processor to be feasible: 


IC manufacturing technology should be able to fabricate enough transistors on a 
single chip to create a multi-processor chip. 


A large chip partitioned into many processors would produce higher throughput 
than the same chip fabricated as a large uniprocessor. 


Chip throughput (measured in bits per second) would exceed the capacity of 
conventional multiplexers, justifying the use of the OM. 


The package of such a multiple processor chip would require so,many pins that 
package size would be excessive and a multiplexer would be used instead. 


The first condition 1s easily dealt with by a specific example. The Intel 8080 


microprocessor contained about 4500 transistors [Ref. 12], while Motorola’s MC68020 
contains about 200000 [Ref. 13]. Using the technology of the Motorola MC68020, one 


could produce a chip with over 40 Intel 8080s. Clearly, manufacturers can already 


fabricate a multiple-processor chip. The remaining points require further discussion 


and are covered in Chapters II and III. 
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Il. OPTIMUM ARCHITECTURE OF LARGE INTEGRATED CIRCUITS 


Chapter I’s demonstration that a multiple-processor chip could be fabricated 
prompts the following questions: 


e Is a multiprocessor chip the best use of [C fabrication technology, or should all 
available transistors be assembled into a single processor? 


e How large (in terms of transistor count, heat dissipation, and number of 


Oe would a chip have to be in order to justify the use of the 
ptoelectronic Multiplexer? 


A. PARTIONING SILICON FOR MAXIMUM THOUGHPUT 

Should designers divide the available silicon among a few large and capable 
processors Or among many, less capable processors? Which mix yields the highest 
throughput? 

Consider a system of N processors, each executing the same program and 
producing the same number of output data words each second. Applications of such 
architectures abound in the field of real time signal processing, which uses regularly 
structured algorithms. As N increases, processors share the load, so each may run 
more slowly without changing the speed of the system. If we imagine a system 


throughput goal of R bits per second (bps), then: 
R = NS (eqn 221) 


where R= System throughput (bps) 
N= Number of processors 


S= Throughput of each processor (bps). 


Sreq'd = RN7 (eqn 2.2) 
where Sreq'd= Speed required of each processor 
in order to meet the system goal of R bps. 


These equations describe what is required of a processor--but how does a 
processors actual performance vary with N? At issue is the apportionment of the 


entire chip’s allotment of transistors and heat dissipation ability among N processors. 


1/7 


|. Transistor Constraints 


Assuming we can put only so many devices on a chip, then: 
t = TN”? | (eqn 2.3) 


where t= complexity of any processor, measured in transistors 
N= number of processors 


T= Total number of transistors on chip 


Generally, a complex processor will be able to perform a given calculation 
faster than a simple processor. For example, a microprocessor with an on-board 
floating-point unit can handle a multiplication in a few clock cycles, while a smaller 
processor has to do tedious successive additions, requiring much more time. But what 
is the exact relationship between processor complexity and speed? To answer this we 
shall examine the specifications of some existing processors, as listed in Table I and 


graphed in Figure 2.1. 


TABLE I 
SPECIFICATIONS OF SOME ACIUAR TRG GESSOre 
Reference Data Word Time Required for Bit Transistor 
(Bits) Multiplication Rate Count 
(107° sec) (10® sec™+) (thousand) 
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COMPLEXITY 


From the experimental relationships between processor speed and complexity 


shown in Figure 2.1, we can see that the data in each group are approximated by the 
equation: 


Sproc = At* (eqn 2.4) 


where Sproc= processor speed (in bps throughput) 
t= processor complexity (in number of transistors) 
A=empirical constant of proportionality given in Table II 


a=empirical constant given in Table II. 


TA at 
EXPERIMENTAL CONS TAN Ts 


Group A a 
CPU 81-82 6.69 102, 0.783 
CPU 82-85 5.16 * 102° 2.07 
FPU 4.22 x 10 0.711 


Equation 2.4 describes how, in some typical one-processor systems, processor 
speed is related to complexity. To apply these findings to a N-processor system of T 
transistors, we combine equations 2.3 and 2.4: 
Sproc = NUN Sale (eqn 2a) 


Sproc — A Nes 


Sproc 


a 


where Sproc= processor speed (in bps throughput) 
t= processor complexity (in number of transistors) 
N= number of processors 


A and a are constants given in Table II. 
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A family of “processor curves” may be used to describe the tradeoff between 
individual processor speed and the number of processors, constrained by a constant 
number of transistors. The tradeoffs are shown in Figure 2.2. For example, consider 
the curve labeled “CPU 82-85,” which is based on a constant 10° transistors per chip. 
If these transistors are divided into 10 processors of 10° transistors each, Equation 2.4 
predicts that each will produce about 116% 10° bps of output. But if the chip is 
divided into more (for example 25) processors of 4 X 10* transistors each, then these 
less complex processors will be capable of only about 17.3 x 10° bps each. 

When we superimpose these processor curves (Figure 2.2) with a family of 
“system” curves, generated by choosing several values of “R” in Equation 2.2, the result 
(Figures 2.3 and 2.4) yields a strategy for choosing N. Where the processor curve 
(describing what the processor can do) intersects the system curve (describing what 
each processor must do ) determines the number of processors (N) into which the chip 
should be divided to yield that particular level of system throughput. For example, to 
achieve a system throughput of 10? bps, Figure 2.3 shows the chip should be divided 
into about 12 processors (point A). Yet choosing to partition the silicon into fewer, 
larger processors (point B) yields a higher system throughput of 2 x 10? bps. 

In general, when processor speed 1s a strong function of complexity, that is 


when: 
Sproc = At* With a > 1 (eqn 2.6) 


then Sproe is proportional to N~* (a> 1) while Sreq'a is proportional to N7?. Thus, 
Sproc falls faster than Sreqd as N increases. In this case, the highest performance will 
always result from choosing the lowest N possible, in other words N=1. This strategy 
may be constrained for very large values of T--there may not be a processor design 
which can effectively use 10’ transistors, for example. Also, the optimistic relationship 
of Equation 2.6 may not hold for large values of t. 

On the other hand, when a weak relationship exists between speed and 
complexity, as shown in Figure 2.4, the best strategy is to select N as /arge as possible. 
As before, however, there are limits to this rule. It may be impractical to divide the 


computational task beyond a certain point. For example, a 256-point FFT probably 
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Figure 2.4 


can not be efficiently shared by more than 128 x 8 = 1024 processors.! Also, as N 
increases and t decreases, processors will eventually become too simple to function as 
microprocessors. For example, excessive reduction in processor complexity could yield 
a circuit unable to retain a data word or perform a basic calculation. 
2. Power Constraints 
Each chip can only dissipate a given amount of heat. The power available to 


any individual processor 1s: 


p = PN”? (com 2.7) 
where p= Power available to any one processor 
N=number of processors 


P = Total power available to the chip 


TABLE III 
SPECI ten PONS OF SOME ACTUAL PROCESSORS 
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There are 256+2 = 128 processors per stage and log,(256) = 8 stages. 
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Examining the relationship between processor speed and power in the light of 
data from actual processors, (Table III] and Figure 2.5) there in no clear trend evident 
in Figure 2.5. In particular, there is a great deal of scatter in the CMOS multiplier 
chip data. This may be due to differences in the way researchers report power 
dissipation data; for example, some may report only the power consumption of the 
computational segment, while others report the power used by the entire chip, 
including bus drivers. In spite of these limitations, one interpretation of the 


power/throughput data is: 


Sproc = Bp? (eqn 2.8) 
where Sproc= processor speed (in bps throughput) 
p = processor power (in watts) 


B and b are empirical constants given in Table IV. 


Therefore, combining equations 2.5 and 2.8 as before: 


B(PN7*)° (eqn 2.9) 


Sproc 
Sproe = BPoN® 
Sproc = KNee 


Where Sproc= processor speed (in bps throughput) 
p= processor power (in watts) 
N= number of processors 
B and b are empirical constants given in Table IV. 
Figure 2.6 shows the relationship described in equation 2.9, namely, the 
tradeoff of individual processor speed against the number of processors, constrained 
this time by a constant power level, as required by equation 2.7. Since, for the group 


of actual processors examined, 
Sproc = Bp? with b < 1 (eqn 2.10) 


Figure 2.7 shows that the best strategy is to select N as /arge as possible. 


26 


(SLLVAM) MAMOd 





OT 0} OT 01S 
Adi SOND o ' 
Ndd SONN 
CINOOTT 
“San 
a0! 
r= 
(ry 
Ss 
ned 
a2) 
7 
2 
OC: 
; = 
o) Senos 
—— con sete ° 
" 
—— 
OQ, 


GHAMOd ANV GttldsS YOSSHIOUd 


peed and Power 
(Ixperimental). 


Processor S 


Piguee2 


2/ 


TABEE WY 
EXPERIMENTAL CONSTANTS 


Group 


NMOS cpu’s 
CMOS fpu’s 





B. MINIMUM CHIP SIZE FOR OM APPLICATION 
How large (in terms of transistor count, heat dissipation, and number of 
processors) would a chip have to be in order to produce sufficient throughput to justify 
the use of the Optoelectronic Multiplexer? 
1. Minimum Transistor Count 
Assuming the individual processors are of low complexity (like the FPU group 


of Figure 2.1) implies that: 


Sproc = At® with a < 1 (eqn 221cm 
where Sproc= processor speed (in bps throughput) 
t= processor complexity (in number of transistors) 
A=empirical constant of proportionality given in Table II 


a=empirical constant , here < 1. 


For this group, the discussion in the previous section shows that the 
maximum throughput is achieved by partioning the available silicon into the largest 
number of processors possible, limited by the minimum complexity of the simplest 


. e 
processor design.* Therefore: 


Noa Sees (eqn 2.12) 
where T= total number of transistors on chip 
t_. =complexity of the simplest processor design, measured in transistor 
min 
N vax = Number of simple processors possible on chip of T transistors 


“While the components of systolic arrays are less complex than the assumed 
simplest processor, this research did not study the performance of such 
ICs--accordingly they are not considered here. 
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Since each processor produces an output of eres bits per second and there 


are N__ processors, the system throughput 1S: 


=N (eqn 2.13) 


sysmax max proc 


—N UoAL. (eqn 2.14) 


I 
a 


ia) eee (eqn 2al>) 


= TAt . 377 (eqn 2.16) 


Defining S, to be the minimum system throughput for which use of the OM 


is justified leads to: 


S,, = TAt,,,3"* (eqn 2.17) 


oO in 


fier = Sit. aja” (eqn 2.18) 


min min 


Where t_. = minimum number of transistors on chip for OM usage to be justified 


To estimate the value of ae assume: 


Datel 6 10? transistors (lower end of FPU group in Table I) 
A=4.22 X 10° (Table II} 
a=Q.711 (Table IT) 

See ors 10? bps (Curently the upper range of 


conventional multiplexers.) [Refs. 34,35,36,37] 


Sil 


Therefore: 
T . =92000 transistons 
min 


N___ = 13 processors 
max 


Thus, since processors with transistor counts > T_.. are already in existence 
[Ref. 21], it seems that an OM-based single chip multiple processor 1s feasible with 
respect to the number of transistors required. 
2. Minimum Power Dissipation 
What is the minimum heat dissipation of a multi-processor chip which would 


yield throughput in the OM range? 


N.. = Pp...” (eqn 2.19) 


where P= Total power dissipation of the chip (watts) 


Pin power used by the simplest processor design, measured in watts 


N__. =number of simple processors possible on chip of P watts 
max 


=N_S (eqn 2.20) 


sysmax max proc 


Substituting from Equation 2.8, 
= N__B : {eqn 2-24) 


sysmax max ga 


And, substituting for N__ from Equation 2.19, 


a =i b 
oe nae LPP ain BP Ain (eqn 2.22 


S =" PRpeak: (eqn 2.23) 


sysmax min 


Defining S__ to be the minimum system throughput for which use of the OM 
is justified leads to: 


Se PED b-1 (eqn 2.24) 


° min 


tal (eqn 2.25) 


min SomtPmin 


where P_. = minimum power dissipation of the chip for OM usage to be justified 
To estimate the value of P_. | assume: 


Pain 0-10 watts (lower end of CMOS FPU group in Table III) 
B= 3.43 x 10° (Table IV) 
b=0.099 (Table IV) 
S_.= 3 * 10” bps 


Therefore: 
P itt Or watts 


Naxx [1 processors 


This power level is quite reasonable, and it would seem that from the 


standpoint of heat dissipation an OM-based multiple processor chip is feasible. 


Il. THE NEED FOR A HIGH-SPEED MULTIPLEXER 


Chapter II demonstrated that current technology could produce a chip whose 
throughput would exceed the capacity of conventional multiplexer technology. But 
why consider serial communications and multiplexers at all? Why not exchange data 


with the chip in parallel via pins or leads? 


A. PROCESSOR POWER LIMITED BY COMMUNICATION PATH 

We have seen that future high-density [C’s may be optimally structured as a 
bank of many processors, each of moderate capability. However, even if 
manufacturers can achieve sufficient circuit density to fabricate a multi-processor chip, 
such a device might not be practical due to the large number of leads needed to 
communicate with each processor from off-chip. For example, imagine an N-processor 
IC designed to compute a 2N-point Fast Fourier Transform (FFT). During the 
computation, the IC must read in, then write out, 2N complex output words, or 4N 
real words. Assuming a 40 bit word size, and using the same pins for input and 


output, we can see this IC would need: 


40 leads 
x [4N words] = 160N leads 
word 


How large a package will we need to handle all these leads? Using a Pin-Grid 


(eqn 3.1) 


Array (PGA) package with pins spaced every 0.1 inch, the area of the package 1s: 


[ 16ON leads] eqn 3-2 
Aten = = 0 Nore a 


10 leads 
2) Semin 


For illustrative purposes we can estimate the area of the silicon chip in this 
package by assuming the chip size of the processor is approximately the same as that 
of the processor recently reported by the Matsushita Corporation of Osaka. [Ref. 28] 
Their processor performs a 32 bit floating point multiplication in about 75 nsec and 1s 
32.6 mm*® in area. A chip containing N of these processors would occupy about 32.6N 
mm‘* of silicon. Thus, the ratio of silicon area to package area in our hypothetical IC 


Ss: 


Silicon Areas [ 32.6N] (Eqn sr5)) 


Ratio of ——— _ = ——_ = 3.2 % 
Package Area [ 1032N] 


As IC fabrication technology improves, this waste of space gets even worse. A 
new production technique enabling manufacturers to produce circuits in half the silicon 
area previously required would permit us to double “N” without increasing the silicon 
area. Yet package area would double, due to increased pinout requirements. Once 
some maximum package size 1s reached, further improvements in circuit density do us 
no good--we simply can not communicate with more processors. As one researcher 
stated, “the technology has become increasingly constrained by packaging limitations” 
vet, 38}. 

Increasing lead density will produce some relief from this communication linut, 
but can not be pursued beyond some maximum without excessive fabrication cost. We 
are faced, then, with some maximum package size and maximum lead density, implying 
an eventual limit on the number of leads a single IC can have. 

Given this eventual limit on the number of simultaneous off-chip communication 
paths, Rent’s Rule [Ref. 12:p. 235] 


TAG <° (eqn 3.4) 
where P = Number of chip pads or leads 
G = Number of gates on the chip 


would seem to imply that if the number of paths (P) is limited, then so is the number 
of gates (G) and, therefore microprocessor complexity and computational power. 

This ultimate limit on non-multiplexed designs is not precisely defined. Neither 
maximum package size nor maximum lead density have yet been reached, and industry 


experts are wary of predicting when they might be. In addition, the switch to 


a 


multiplexed designs will probably occur over a range of processor densities and 
complexities, influenced by market factors (there will be few customers for very large 
packages) and manufacturing realities (specialized chip sizes mean more expensive chip 
handling equipment) as well as the theoretical factors described above. 

For all these reasons, large ICs composed of multiple processors will require too 
many pins to use a conventional parallel-transfer scheme with pins or leads. Instead a 
serial communications link must be considered, and as shown in Chapter II, the speeds 


required will exceed the capacity of conventional multiplexers. 


IV. SYSTEM ARCHITECTURE BASED ON SERIAL COMMUNICATION 


Chapters II] and III demonstrate that, in the next generation of ICs, a 
microprocessor may very well be organized as a bank of smaller processors, all sharing 
a relatively few pins through a high-speed multiplexer. But: 


e What on-chip data flow architecture should be employed among_ these 
processors? 


e How can a serial data stream be distributed among N processors? 


e What are the detailed structures of the elements which make up an OM-based 
architecture? 


A. ON-CHIP DATA FLOW ARCHITECTURE 

How should a N-processor chip be organized? The ideal structure will vary with 
the application; this discussion considers one specific application--computing FFTs. 
The number of processors required to compute a given size FFT will depend on 
whether processors are “reused,” that is whether a processor bank’s outputs are 
shuffled and returned to the same processors (reused) or directed to the next bank of 
processors (pipelined). Reusing processors allows a given FFT to be computed with 
fewer processors, but takes more time. The architectures asssociated with both reuse 
and pipeline strategies are discussed in the following sections. 

1. Pipeline Architecture 

Assuming that the throughput of the system is to be maximized, there will be 
no “reuse” of processors. That 1s: 
¢ each processor performs only a two point FFT “butterfly” 


e a new bank of processors performs each stage of the computation in a pipeline 
Strategy. 


The most straightforward architecture for N processors is a N X 1 column. 
How would this grouping affect data flow among the processors? As an example, 
when the task is a 16 point FFT, the processors must exchange data as shown in 
Figures 4.1 and 4.2. Dividing up the 32 processors shown in Figure 4.2 into 4 X | 
chips forces 80 data words to cross chip boundaries during the computation, as shown 
in Figure 4.3. 

Re-organizing the four processors on each chip into a 2 X 2 matrix (Figure 


4.4) results in only 48 words crossing chip boundaries, thereby improving the system’s 


throughput, since off-chip communication delay is lessened, and reducing the demands 
on the communications network. 

The 2 X 2 structure is more efficient because it is the structure of a four point 
FFT. In a sense, the 2 X 2 structure performs all the computations possible on the 
four points it receives, while the 4 X | array, receiving eight points, must hand off its 
data only partially “chewed.” 

There are many such matrices, each corresponding to a particular FFT. For 
example, Figure 4.2 suggests that a 32 point processor IC designed for FFT 
computation would best be configured as a 8 X 4 matrix. In general, the matrix 


dimensions are: 
2h en wiiere m-h2, 3.28 (eqn 4.1) 


2. Reuse Architecture 


The number of processors required by a pipeline architecture to compute a 
P-pome, fe il 1s: 


E 2 aloe ek (eqn 4.2) 


This number of processors may prove to be impractical or simply too 
expensive, or we may not need the ultimate throughput achievable by the pipeline 
architecture, yet still need more throughput than that provided by a uniprocessor. 
Also, 1t may be desirable to adapt an existing pipeline system to compute larger 
FE Ts--without adding processors. In each of these cases, reusing processors in the 
computation enables the designer to tradeoff system throughput for design complexity 
and cost. How are data exchanged among processors in a reuse architecture? 

The computations shown in Figure 4.2 still must be performed, but now 
instead of each block representing an actual processor, it represents a job that some 
processor will have to perform. For example, consider a 16-point FFT performed with 
two 4-processor ICs. Figure 4.5 shows the data exchange for this example if the 
four-processor ICs are organized as 4 X 1 vectors. 

As shown in Table V, even though a chip processes eight points every frame, 
it only transmits four points per frame--keeping half its data onboard for further 
processing With the half it will receive from the other IC. This assumes that some 


on-chip communication path exists to enable processors to exchange data. 
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Figure 4.2 
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Sixteen Point Fast Fourier Transform 


4 X 1 Reuse Architecture. 


Figure 4.5 
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As an alternative, consider the same 16-point FFI computed by two 
4-processor [Cs, this time organized as 2 X 2 matrices, as shown in Figure 4.6 and 
Halbles Vi 

Because each IC is only two processors “wide,” a single [IC can only accept 
four data points at atime. This creates an awkward data flow--the source delivers only 
half the input vector, waits, then delivers the other half. Each chip must store the 
output of its first computation while processing the second half of the input vector. 
However, the number of data points exchanged between chips is sharply reduced from 
24 for the 4 X 1 case toiS for the 2 < 2 7casc 
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Sixteen Point Fast Fourier Transform 


2 * 2 Reuse Architecture. 


Figure 4.6 
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A third organization of these four processors permits transmission of the 
entire data vector (as in the 4 X 1 chip) and minimizes the data exchange (as in the 2 
x 2 chip). Its structure is shown in Figure 4.7 and Table VII. 

This structure, possible only if processors are reused, maximizes the “width” of 
the chip while preserving the communication advantages of a “deep” chip. As 
discussed in the previous section, these advantages stem from performing all the 
calculations possible on a given data set before releasing is to another chip. By not 
allowing “partially chewed” data off the chip, the number of data to be exchanged 
between chips at each stage is minimized. In general, an N-processor chip with this 
reuse architecture can perform a 2N-point FFT if organized as an N X 1 vector which 
performs 1+log,N stages. 

3. Interleaving Data Sets 

The efficiency and throughput of any of these reuse architectures can be 
improved through interleaving data sets--that is, delivering new data to the processors 
to work on while they wait for the communications link to recycle their intermediate 
Outputs back to their inputs. Consider the progress of a 16-point FFT calculation 
performed by eight processors organized as in Figure 4.7. The processor wait time is 
cleary evident in Figure 4.8, in which the data sets are not interleaved. I[n this 
example, the throughput is one FFT per (4Tcatc + Txer). 

In Figure 4.9, however, a new data set is delivered to the processors while they 
wait for the results of the first phase of the calculation to be recirculated. In this 


interleaved case, processors are never allowed to be idle. For this example, throughput 
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Figure 4.7 Sixteen Point Fast Fourier Transform 
Modified 4 X 1 Reuse Architecture. 
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is 2 FFTs per 8Tcate, or 1 per 4Tcare--slightly higher than in the non-interleaved case. 
This improvement in throughput was achieved without an increase in bus speed; 
alternatively one could reduce bus speed requirements without lowering throughput by 


incorporating an interleaved reuse architecture. 


B. DATA DISTRIBUTION 
Data delivery to the processors can be accomplished several ways: 
¢ processing elements all receive the same data in broadcast fashion 


e a eo “know” when it’s their turn to receive data and they query the 
LOR 


e data words are “tagged” with their destination--RBIU reads the tag and delivers 
data words to their intended processor 


® processors contend for bus access with each other 
e RBIU delivers data to processors in a preset schedule. 

Only this last scheme (using a preset schedule) promises to have sufficient speed 
to be acceptable for use with the OM. But is it possible to use an a priori schedule, 
and what would it look like? 

1. Pipeline Architecture 

Returning to the example of a FFT computer built of N-processor ICs, Figure 
4.4 shows the data exchanges required by a sixteen point FFT if a pipeline architecture 
is used. 

The sequence of data on the bus is essentially arbitrary. In choosing the 
sequence, it 1s reasonable to avoid sequences which deliver several data words to the 
same BIU one right after the other, in order to minimize the speed required of the BIU. 
Figure 4.10 shows one suitable choice. 

Due to the regular structure of the FFT, there is a simple algorithm to 
calculate the address of any data word’s destination, based only on its position in the 
data stream, as shown in Table VIII. Because of this, the RBIU’s data distribution 
logic can be implemented with little more than a binary counter. The transmission 
algorithm 1s equally uncomplicated, as shown in Table IX. 

The fact that inter-stage data exchange patterns in the FFT computation are 
regular and easily implemented in hardware lends further support to the use of preset 


schedules to control BIU data distribution. 
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Figure 4.10 Sixteen Point Fast Fourier Transform 


Distributed Among Four Multi-Processor Chips. 
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2. Reuse Architecture 


Tables X and XI and Figure 4.11 show the data flow structure required to 


compute a 16-point FFT with a reuse architecture. Although the task 1s accomplished 


with fewer processors than in Figure 4.10, there are three additional complications: 


additional buffers directly connect processors which must exchange data in 
intermediate stages of the calculation 


an internal path exists between TBIU and RBIU to allow processors which are 
not directly connected to exchange data 


BIUs must coordinate the use of internal and external paths. 
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C. RECEIVER TASKS 
We can view the data distribution circuitry as being separated into a Receiving 
Bus Interface Unit (RBIU) and a Transmitting Bus Interface Unit (TBIU). The RBIU 
must: 
¢ capture data from the high-speed bus 
¢ convert data from serial to parallel format 
e perform error detection/correction 


¢ deliver the data word to its destination processor. 


Figure 4.12 shows the architecture developed in this project to accomplish these 
tasks. It may be noted that this architecture uses a separately distributed clock signal. 


This scheme was used to simplfy the construction and testing of a system prototype, 
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but once past this phase the clock could be embedded in the data stream itself (as in 
Manchester coding), eliminating the need for a separate clock line. Alternatively, if a 
fiber optic data link were used, the clock could be sent on the same fiber as the data, 
but at a different carrier frequency (color), allowing clock recovery tndependently of 
data reception. 

The control signals shown in Figure 4.12 also deserve some discussion. The 
RBIU circuitry develops these signals as a function of the bit count, then distributes 
the signals depending on which word is currently being received. These signals control 
the First-In-First-Out (FIFO) stacks which buffer data between the RBIU and the 
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Processing Elements (P/E), as well as between the P/Es and the TBIU. These FIFOs 
require signals to cause them to: 

e load a new data word (from the RBIU) 

¢ output the next word (to the TBIU) 


e advance the stack to bring up the next output word (now that the TBIU has the 
current word) 


D. TRANSMITTER TASKS 
The transmission part of the data distribution circuitry must: 
e take the data word from its source processor. 
¢ convert data from parallel to serial format 
e add error detection bits 


¢ insert data onto the high-speed bus 
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Figure 4.11 
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Figure 4.12 Receiving Bus Interface Unit Architecture. 
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Figure 4.13 Transmitting Bus Interface Unit Architecture. 


Figure 4.13 shows the architecture developed in this project to accomplish these 
tasks. The control and timing circuitry needed to interface the output FIFOs with the 
TBIU 1s included as part of the RBIU diagram. 


E. CONCLUSIONS AND LIMITATIONS OF THIS RESEARCH 
1. Conclusions 

The high speed serial communication provided by the Optoelectonic 
Multiplexer makes possible a shared-bus parallel processing architecture for problems 
like the FFT where the data distribution schedule can be determined a priori. The data 
distribution algorithms for the FFT are quite simple and can be realized with little 
more than a binary counter. 

For the FFT, processors groupings on chip should correspond to the 27? Xn 
matrices inherent in the FFT calculation in order to minimize the amount of inter-chip 
communications. 

Trends in actual processor data suggest that the throughput of the processor 


in most cases 1S proportional to the [size of the processor}, where <1 and “size” 
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refers to both transistor count and power dissipation. This implies that, for a given 
chip size, dividing the chip into increasing numbers of smaller processors raises the 
number of processors faster than it lowers the throughput of an individual processor. 
Thus, for most types of processors studied, the greatest throughput is achieved by 
organizing a large chip as a bank of many simple processors. 

Finally, a single-chip OM-based parallel processor is feasible since: 


e Manufacturers can fabricate sufficient transsitors on a single chip to construct 
many simple processors. 


e A chip composed of only about 12 simple processors, easily achieved with current 
fabrication echno’oey. could produce enough throughput in a highly structured 
problem (like the FFT) to justify the use of the OM’s high capacity. 


e Constructing such a chip in a conventional package using one pin or lead per bit 
would require an excessive package size. 


2. Limitations and Recommendations 

The architecture described in this report was designed with only the FFT in 
mind. It may not be adaptable to less structured calculations or to systems which 
must perform a wide variety of calculations. 

Multiple-processor chip performance was predicted based on a _ limited 
sampling of current processor data. Further research, using a comprehensive study of 
actual processor performance, is needed to augment the simple model developed here. 

The comparison of conventional leaded packages and serially multiplexed 
packages considered only the extremes of one pin per bit and one pin per chip. 
Additional study of alternatives between these endpoints is needed to determine at 
what point the cost (in terms of dollars, chip area, and heat) of the Optoelectronic 


Multiplexer is justified by its higher performance. 
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